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Abstract 

Let n be an even positive integer and F be the field GF(2). A word in F™ is called 
balanced if its Hamming weight is n/2. A subset C C F n is called a balancing set if for 
every word y E F n there is a word x £ C such that y + x is balanced. It is shown that 
most linear subspaces of ¥ n of dimension slightly larger than | log 2 n are balancing 
sets. A generalization of this result to linear subspaces that are "almost balancing" is 
also presented. On the other hand, it is shown that the problem of deciding whether a 
given set of vectors in F n spans a balancing set, is NP-hard. An application of linear 
balancing sets is presented for designing efficient error-correcting coding schemes in 
which the codewords are balanced. 



1 Introduction 

Let F denote the finite field GF(2) and assume hereafter that n is an even positive integer. 
For words (vectors) x and y in F n , denote by w(x) the Hamming weight of x and by d(x, y) 
the Hamming distance between x and y. 

* This work was done while visiting the Information Theory Research Group at Hewlett-Packard Laborato- 
ries, Palo Alto, CA 94304, USA. 
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We say that a word z G F n is balanced if w(z) = n/2. For a word x G F n , define the set 

B(x) = {x + z : z is balanced} 

= {yGF": d(i/,x) = n/2} . 

In particular, if denotes the all-zero word in ¥ n , then 6(0) is the set of all balanced words 
in F". It is known that 

<( ") = \B(x)\ < -*L= (1) 



V%n \n/2j ~ y/nn/2 

(see, for example, (TDJ p. 309]). We extend the notation £>(•) to subsets C C F n by 

B(C) = |J B(x) . 



A subset C C F n is called a balancing set if i3(C) = F n ; equivalently, C is a balancing set 
if for every y G F n there exists a; G C such that d(i/, x) = vj(y + x) = n/2 (which is also 
the same as saying that for every y G ¥ n one has B(y) R C ^ 0). Using the terminology of 
Cohen et al. in [6j §13.1], a balancing set can also be referred to as an {n/2} -covering code. 

An example of a balancing set of size n was presented by Knuth in [9]: his set consists 
of the words x±, x 2 , . . . , x n , where 

Xi = 11 ... 1 OO^O . 

i n—i 

It was shown by Alon et al. in [l] that every balancing set must contain at least n words; 
hence, Knuth's balancing set has the smallest possible size. 

As proposed by Knuth, balancing sets can be used to efficiently encode unconstrained 
binary words into balanced words as follows: given an information word u G F n , a word x 
in a balancing set C is found so that u + x is balanced. The transmitted codeword then 
consists of u + x, appended by a recursive encoding of the index (of length [log 2 |C|]) of x 
within C. Thus, when \C\ = n, the redundancy of the transmission is (log 2 n) + O (log log n). 
By (EE|), we can get a smaller redundancy of |(log 2 n) + 0(1) using any one-to-one mapping 
into £>(0). Such a mapping, in turn, can be implemented using enumerative coding, but the 
overall time complexity will be higher than Knuth's encoder. 

In many applications, the transmitted codewords are not only required to be balanced, 
but also to have some Hamming distance properties so as to provide error-correction capabil- 
ities. Placing an error-correcting encoder before applying any of the two balancing encoders 
mentioned earlier, will generally not work, since the balancing encoder may destroy any 
distance properties of its input. One possible solution would then be to encode the raw in- 
formation word directly into a codeword of a constant-weight error- correcting code, in which 
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all codewords are in 13(0). By a simple averaging argument one gets that for every code 
CCP there is at least one word seP for which the shifted set 

C + x = {y E¥ n : y-xeC} 

contains at least (( n " 2 )/2 n )|C| > \C\/y/2n balanced words. Yet, for most known constant- 
weight codes, the implementation of an encoder for such codes is typically quite complex 
compared to the encoding of linear codes or to the above-mentioned balancing methods |12j . 

In this work, we will be interested in linear balancing sets, namely, balancing sets that 
are linear subspaces of F™. Our main result, to be presented in Section [31 states that most 
linear subspaces of F™ of dimension which is at a (small) margin above | log 2 n are linear 
balancing sets. A generalization of this result to sets which are "almost balancing" (in a 
sense to be formally defined) will be presented in Section [H On the other hand, we will 
prove (in Appendix [Bj that the problem of deciding whether a given set of vectors in F n 
spans a balancing set, is NP-hard. 

Our study of balancing sets was motivated by the potential application of these sets in 
obtaining efficient coding schemes that combine balancing and error correction, as we outline 
in Section [51 However, we feel that linear balancing sets could be interesting also on their 
own right, from a purely combinatorial point of view. 

2 Existence result 

From the result in [1] we readily get the following lower bound on the dimension of any linear 
balancing set. 

Theorem 2.1. [1] The dimension of every linear balancing set C C F n is at least |~log 2 n~\ . 

As mentioned earlier, we will show that most linear subspaces of F n of dimension slightly 
above | log 2 n are in fact balancing sets. We start with the following simpler existence result, 
as some components of its proof (in particular, Lemma 12.31 below) will be useful also for our 
random-coding result. 

Theorem 2.2. There exists a linear balancing set in F n of dimension [|log 2 n]. 

Theorem 12.21 can be seen as the balancing-set counterpart of the result of Goblick [8] 
regarding the existence of good linear covering codes (see also Berger [21 pp. 201-202], 
Cohen [5], Cohen et al. [51 §12.3], and Delsarte and Piret [7]); in fact, our proof is strongly 
based on their technique. In what follows, we will adopt the formulation of [TJ. 

Before proving Theorem l2.21 we introduce some notation. We denote the union CU(C+x) 
by C + Fx. (When C is a linear subspace of F" then so is C + Fx, and C + x is a coset of C 
within F n .) 
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We also define 

Q(C)=2- |F"\B(C)| = . 

Namely, Q(C) is the probability that B(x) fl C = 0, for a randomly and uniformly selected 
word x E F n . 

The proof of Theorem 12.21 makes use of the following lemma. 
Lemma 2.3. For every subset C C F n , 

2-^g( C + F a; ) = (g(C)) 2 . 

xe¥ n 

Proof. The proof is essentially the first part of the proof of Theorem 3 in [7], except 
that we replace the Hamming sphere by B(-). For the sake of completeness, we include the 
proof in Appendix [A] □ 

Proof of Theorem 12.21 Again, we follow the steps of the proof of Theorem 3 in [TJ. 
Write £ — |~| log 2 n\ . We construct iteratively linear subspaces Cq C C\ C • • ■ C Ce as follows. 
The subspace Co is simply {0}. Given now the subspace Ci-±, we let 

= Ci_! + F Xi , 

where word in F n such that 

QiC^+Fx^KiQiC^)) 2 ; 
by Lemma 12.31 such a word indeed exists. Now, 



2 n \n/2J ~ ^ 

where the last step follows from the lower bound in ([1]). Hence, 

e -n/V2 < 2 



Q(C e ) < (Q(C )f < (l 



1 \ n 3 / 2 



'2w 

As 2 n Q(Cz) is an integer, we conclude that Q(Cc) is necessarily zero, namely, B(Cg) = ¥ n . □ 

3 Most linear subspaces are balancing sets 

The next theorem is our main result. Hereafter, N stands for the set of natural numbers, and 
the notation exp(z) stands for an expression of the form a ■ 2 bz , for some positive constants 
a and b. 
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Theorem 3.1. Given a function p : (2N) — > N, let C be a random linear subspace of¥ n 
which is spanned by [| log 2 n\ + p(n) words that are selected independently and uniformly 
from¥ n . Then, 

Prob{C is a balancing set} > 1 — exp(— p(n)) . 

(Thus, as long as p(n) goes to infinity with n, all but a vanishing fraction of the ensemble 
of linear subspaces of F™ of dimension |~| log 2 n\ + p{n) are balancing sets.) 

Theorem 13.11 is the balancing-set counterpart of a result originally obtained by Bli- 
novskii [3J, showing that most linear codes attain the sphere- covering bound. An alternate 
proof for his result (with slightly different convergence rates asn-> oo) was then presented 
by Cohen et al. in [6j §12.3]. The proof that we provide for Theorem 13.11 can be seen as an 
adaptation (and refinement) of the proof of Cohen et al. to the balancing-set setting. 

We break the proof of Theorem 13.11 into three lemmas. To maintain the flow of the 
exposition, we will defer the proofs of the lemmas until after the proof of Theorem 13.11 

Lemma 3.2. Let Cq be a random linear subspace of¥ n which is spanned by \\ log 2 n\ ran- 
dom words that are selected independently and uniformly from ¥ n . There exists an absolute 
constant j3 G [0, 1) independent of n (e.g., P = \) such that 

Prob{Q(C ) > p} < exp(-n) . 

Lemma 3.3. Let Co be a linear subspace of¥ n . Fix a positive integer r, and let C\ be a 
random linear subspace of¥ n which is spanned by Co and r random words from ¥ n that are 
selected uniformly and independently. Then 

Prob{Q(d) > (Q(C )) (r/2)+1 } < (Q(C )) r/2 • 

Lemma 3.4. Let C\ be a linear subspace of ¥ n and let C 2 be a random linear subspace 
of¥ n which is spanned by C\ and |~log 2 n\ random words from ¥ n that are selected uniformly 
and independently. Then 

Prob {Q(C 2 ) > 0} < 8Q(d) . 

Proof of Theorem 13.11 It is known (e.g., from [101 p. 444, Theorem 9]) that 

Prob {C ^ F n } < exp(n - p{n)) . 

Hence, we can assume hereafter in the proof that p(n) is at most linear in n. 

Let IA be the list of \U\ = [|log 2 n] + p{n) random words from F n that span C, and 
write I = \\ log 2 n\, t — |Tog 2 n\ , and r = \U\—£—t. We partition the words in U into three 
sub-lists, Uq, Ui, and U 2 , of sizes £, r, and t, respectively. We denote by C , C±, and C 2 the 
linear spans of U , U UUi, and U U U\ U W 2 , respectively. 
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Take (5 — | (say). By Lemma f3T2l we get that 

Prob{Q(C ) > P} < exp(-n) . (3) 

By Lemma [3.31 we have 

Prob {Q(d) > /3 (r/2)+1 | Q(Co) <P}< P r/2 ■ (4) 



Finally, by Lemma [3.41 we get 

Prob [q{C 2 ) > | Q(C X ) < < (8(3) ■ f3 r ' 2 . (5) 

The result is now obtained by combining (J3j) — (J5j) and noting that (3 r l 2 = exp(— p(n)). □ 
Next, we turn to the proofs of the lemmas. 

Proof of Lemma 13.21 Write £ — |~~ log 2 n\ , and let x±, x 2 , . . . , x% denote the random 
words that span Cq. The proof is based on the fact that, with high probability, the Hamming 
weight of each nonzero word in Cq is close to n/2. Indeed, fix some nonzero vector (a;)f =1 in 
F^. Then the sum x = 5Z i=1 diXi is uniformly distributed over F n and, so, by the Chernoff 
bound, for every 5 > there exists r] = r](5) > such that 

Prob{|w(z:) - ^| > <m} < 2""" . 

Given some 5 G [0, |), let £ denote the event that Cq has dimension (exactly) I and each 
nonzero word in Cq has Hamming weight within (~ ± 5) n; namely, 

£ = j|w(a;) — — | < 5n for every x = Yli=i a i x i where («j)f =1 £F*\ {0}| . 
By the union bound we readily get that 

Prob {£} > 1 - 2 e - 2-" n = 1 - exp(-n) . 

Let x and x' be two distinct words in Cq, write d(a?, cc') = rn, and suppose that \ — 5 < 
t <\ + 5. If rn is odd then \B(x) n B(sc')| = °- Otherwise, 



|jB(oj) n B(aj') 



< 



rn \ / (1— r)n 
rn/2y V(l-r)n/2 

2"rn 2(1 — T ) n 



yVrn/2 v /7r(l-r)n/2 
2 n+i 



7m 
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where the second step follows from the upper bound in (1T1). 



Conditioning on the event £, we get by de Caen's lower bound [I] that 

|e(Co)l " ^CTWnM 



n \ I I 2 e -2 n+2 



n/2) I Vvm\/T^452 V™/ 2 



> 2* 



>r,t , (7 ) 



7T 



where in the last step we have used the lower bound in (CD). On the other hand, we also have 
2 e > y/n and, so, writing 

we get that, conditioned on the event S, 

wo = 1 - ^ < m ■ (8) 

The result follows by recalling that Prob {£} > 1 — exp(— n) and observing that (3(5) < 1 for 
every <5 G [0, |) (in particular, there is some 5 for which [3(5) — | > /3(0)). □ 

Remark 3.1. Suppose that Co(m,£) is an ^-dimensional linear subspace of the linear 
[n=2 m , m, 2" 1 " 1 ] code over F obtained by appending a fixed zero coordinate to every codeword 
of the binary [2 m — l,m, 2 m ~ 1 ] simplex code. In this case, we can substitute 5 = in (|SD and 
obtain that Q(Co(m, £)) < (3(0) ~ 0.748, for every £ in the range m/2 < £ < m. Thus, 
Co(m,£) can replace the random code Cq in Lemma [3.21 If £ grows sufficiently fast with m 
so that £—(m/2) tends to infinity, then from ([7]) it follows that 

lim Q(C (m, £)) < 1 - J » 0.607 . 

■m, (m/2)— >oo 8 

Let Cq = C' (m,£) be given by Co(m,£) + Fa;, where a; is an odd-weight word in F n . For 
m > 1 we have |£>(C )| = 2\B(Co(m, £))\. Therefore, when m, £—(m/2) — > oo, we can bound 
Q(C ) from above by 1 - (tt/4) « 0.215. □ 



Proof of Lemma 13.31 Let jci, a;2, . . . , cc r be the random words that, together with Co, 
span (the random code) C%. Obviously, B(Co + Xi) C and (^(Co + ^i) = Q(Co) for every 

i = 1,2, ... ,r. Hence, the expected value of Q(C\) (taken over all the independently and 
uniformly distributed words x±, x 2 , . . . ,x r G F n ) satisfies 

E{Q(d)} = 2-"^Prob{ 2 /£S(C 1 )} 
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y&"\B(C ) i=l 

= (Q(c )Y +l . 

Therefore, 

Prob{g(C 1 )>(Q(C )) (r/2)+1 } 

< Prob{Q(d) > (g(C ))^ /2 E{g(C 1 )}} 

< (Q(c )) r/2 , 

where the last step follows from Markov's inequality. □ 

Proof of Lemma 13.41 The result is obvious when Q{C\) ^ (0, |); so we assume hereafter 
in the proof that Q(C\) is within that interval. Write t = |"log 2 n \ , and let Xi, X2, ■ ■ ■ , x% be 
the random words that, together with C±, span C 2 - For i = 0, 1,2, ... ,t, define the linear 
space Li iteratively by £ = C\ an d 

Ci = Ci_i + Fxi . 



Letting Qi stand for (the random variable) Q{Cj) and u>i for 2 l /(8Q(Ci)), by Lemma [2731 
and Markov's inequality we get for every i — 1, 2, . . . , t that, conditioned on an instance of 

ProbjQ, > Q 2 „^ J 

= Prob|g(A_i + Fa;,) > Q 2 ^ | £ 4 _x} 

< - = (8Q(d)) • 2"* . 
a;. 



Hence, for every z = 1, 2, . . . , £, 



Prob{Q,>Q 2t rK" 1 } 

i=l 

i 

< Prob{|J(Qi > Qt-i^i)} 

i=i 

t 

8=1 
4 1 

< ^_ <8 g(c 1 ). 



i=i 
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Substituting Q = Q{C\) and Q t = Q(C 2 ), we conclude that 

t 

Prob{g(C 2 ) > (QiC^f Hoof ) < 8Q(C 

i=i 

where 

oo 



1 ' 



, 2 1 

2 



i=l i=l 



(Wi)) 



2' 



(8Q(C 1 ))££i 2 - < 



= < 2~ n . 

The result follows by recalling that the events "Q(C 2 ) > 2" n " and "Q(C 2 ) > 0" are identical. 

□ 

Figure [I] lists the generator matrices of linear [n, k, d] codes over F that form linear 
balancing sets, for several values of n that are divisible by 4. These matrices were found 
using a greedy algorithm and they do not necessarily generate the smallest sets, except for 
n = 12 and n = 20, where the sets attain the lower bound of Theorem 12.11 (in addition, for 
the case n = 20, the set attains the Griesmer bound [TUJ §17.5]). 

Remark 3.2. In view of Remark 13. 11 when n = 2 m (or, more generally, when n is "close" 
to 2 m ), Theorem 13.11 holds also for the smaller ensemble where we fix [m/2] basis elements 
of the random code C to be linearly independent codewords of the code C (m, [m/2] ) defined 
in Remark 13.11 Furthermore, if these [m/2] rows are replaced by I basis elements of the 
code C' (m,£) (as defined in that remark), then the value (3 in the proof of Theorem 13.11 can 
be taken as 1 — (vr/4) (^ 0.215) whenever £—(m/2) goes to infinity (yet more slowly than 
p{n)). □ 

We leave it open to find an explicit construction of linear balancing sets in F n of dimension 
O(logn). We also mention the following intractability result. 

Theorem 3.5. Given as input a basis of a linear subspace C of¥ n , the problem of deciding 
whether C is a balancing set, is NP-hard. 

The proof of Theorem 13.51 is obtained by some modification of the reduction in [TT] from 
Three-Dimensional Matching. We include the proof in Appendix El 



4 Linear almost-balancing sets 

While the code Co(m,£=m) in Remark 13.11 is such that Q(Co(m,m)) is bounded away from 
zero, this code can be seen as "almost balancing" in the following sense: for every word y6F" 
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[8,3,3] : 
[12,4,5]: 

[16,5,7]: 
[20,5,9]: 

[24,6,9]: 
[28,6,11] : 

[32,7,13] : 



/OOOOl 1 1 1\ 

01110010 
\10001 100/ 

/OOOOOOllllllX 
/ 000111001110 i 

101001011100 
VllllOOOOlOOO/ 

/OOOOOOOOllllllllX 
' 0001111100001110 
0110011101111100 
1101011011001000 
^1111111100010000/ 

/OOOOOOOOOOllllllllllN 
' 00000111110000111110 ' 
01111001110111001100 
11100101101100101000 

Viioiomoiooiooioooo/ 

/OOOOOOOOOOOOllllllllllllx 
' 000000011111000000111110* 
000111100111001111011100 
001001111011110011111000 
. 111111110100000000010000 
V11010101101010000010000 0/ 

/OOOOOOOOOOOOOOllllllllllllllx 
' 0000000111111100000011111110 i 
0001111000111100111100111100 
0010011011001111001111111000 
1111110110111000000000010000 
\101 101 1101 1001 1000000010000 0/ 

/OOOOOOOOOOOOOOOOllllllllllllllllX 
'00000000011111110000000011111110* 
00000111100000110000011101111100 
01110011001001010101110110101000 
01101110100101011111011010110000 
10110111000111000110111001100000 

yioiooiooiooooooii loin lioioooooo/ 



Figure 1: Bases of linear balancing sets for n — 8, 12, 16, ... , 32. 

(where n = 2 m ) there exists a codeword x G Co(m,m) such that \d(y,x) — (n/2)\ < y/n/2. 
The proof of this fact is similar to the one showing that the covering radius of the first-order 
Reed-Muller code is at most (n — ^/n)/2 \()\ pp. 241-242] (specifically, in the line following 
Eq. (9.2.4) therein, simply reverse the inequality in "](-, -)| > y/n" ; see also (TTT1) below). 

Next, we formalize the notion of almost balancing sets and present generalizations for 
Theorems 12.21 and 13. II In what follows, we fix some function A : 2N — * N such that A(n) < 
n/2, and write A = A(n) for simplicity. For a word x 6 F n define the set 

B x (x) = {y E ¥ n : \d(y, x) - n/2\ < A} . 

As was the case for A = 0, the notation B\{-) can be extended to subsets C C F n by 

B X (C) = |J B x (x) . 

A subset C C F n is called a A- almost-balancing set if B\(C) = F n ; equivalently, C is a 
A-almost-balancing set if for every y e F n there exists x G C such that \d(y, x) — n/2\ < A. 
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The following theorem can be seen as a generalization of Theorem 12.21 

Theorem 4.1. Suppose that A = A(n) = 0(y/n). There exists a linear X-almost- 
balancing set of dimension [ | log 2 n — log 2 (2A + 1) + 0(X 2 /n)~\ . 



Proof. We follow the steps of the proof of Theorem 12.21 with Q(Ci) replaced by a 
term Q\{Ci) which equals 1 — 2~ n B\{Ci), and with (T5]) replaced by an upper bound on 
Q\{Cq) = Qx{{0}) which we shall now derive. 

Let H : [0,1] — > [0,1] be the binary entropy function H (z) = — (z\og 2 z) — (1— z) log 2 (l— z). 
Then, 

(n/2)+A 

18^(0)1 - £ (I 

j=(n/2)-A ^ 

n 



> (2A + 1) 



n/2 - A 



> 2A + 1 _ .9"H(W) 
~ y/2n(l - 4(\/n) 2 ) 

> ^!. 2 ™H(^), (9) 

V2n 



where the penultimate step follows from a well known lower bound on binomial coeffi- 
cients pIJJ p. 309]. From (jHJ) we have, 

Qa(Co) < i - ^ • 2 -»< i - H <i-a» , 

V2n 

thereby obtaining the counterpart of (j2j). Proceeding as in the proof Theorem l2.2l we see that 
[ | log 2 n — log 2 (2A + 1) + n (l — H (| — ^))] basis elements are sufficient to span a linear 
A-almost-balancing set. 

Finally, using the Taylor series expansion for H(| — z) and recalling that A = 0(\fn), we 
obtain 

= ^(2 + o(l)) =0[^) , (10) 

thereby completing the proof. □ 

Observe that for n = 2 m and A = [y/n/2\, the code Co(m,m) realizes the dimension 
guaranteed in Theorem 14.11 

The following theorem is a generalization of Theorem 13.11 
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Theorem 4.2. Suppose that A = A(n) = 0{^Jn). Given a function p : 2N — > N, /et C 
6e a random linear subspace of¥ n that is spanned by |~| log 2 n — log 2 (2A + 1)] + p{n) words 
selected independently and uniformly from F n . Then, 

Prob{C is a A- almost-balancing set} > 1 — exp(— p(n)) . 



Proof. The proof is the same as that of Theorem 13.11 except that Q(-) is replaced 
by Q\(-) in Lemmas 13.31 and 13.41 (and in their proofs), and Lemma 13.21 is replaced by the 
following lemma. □ 

Lemma 4.3. Suppose that A = 0(y/n), and let Co be a random linear subspace of ¥ n 
which is spanned by \\ log 2 n — log 2 (2A + 1)] random words that are selected independently 
and uniformly from F n . There exists an absolute constant (3 G [0, 1) such that 

Prob{Q A (C ) >/3} <exp(-n) . 
The proof of Lemma 14.31 can be found in Appendix ICl 

While Theorems 14.11 and 14.21 only cover the case where A = 0{y/n), we next show that 
when A = Q(y/n), it is fairly easy to obtain an explicit construction for linear A-almost- 
balancing sets with relatively small dimensions. Specifically, let s and m be any two positive 
integers, and set n = s ■ 2 m and A = \_y/sn/2\. The construction described below yields a 
linear A-almost-balancing set of dimension at most 2(log 2 n — log 2 (2A)). 

Given m and s, let Co = Co(m, m) be the linear [M=2 m , m, 2 m_1 ] code over F as in 
Remark I3.1[ and let Ci, c 2 , . . . , cm denote the codewords of Co- It is shown in [6] that for 
every word y G F M , 

M 

J2(M-2d(y, Ci )f = M 2 (11) 

i=l 

(from which one gets that there exists at least one codeword q G Co such that |(M/2) — 
d(z/, c.j)| < VM /2; see the discussion at the beginning of this section). 

(s) 

Consider now the code C which consists of the words X\, a? 2 , . . . , xm, where 

Xi = | Ci I . . . I Ci) , i = 1, 2, . . . , M . 

" v ' 

s times 

Clearly, C { s) is a linear [n=sM,m] code over F. Given a word y G F n , we write it as 
(l/i I 2/2 I • • • I Us) where each block yj is in F M , and define 



z, 



i.j 



Obviously, 



M-2d( yj ,d), * = 1,2,. ..,M, j=\,2,...,8. 



n-2d(y,Xi) = ^2z i;j , i = l,2,...,M 

3=1 
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and, so, 



M 2 Ms 

J2(n-2d(y, Xl )) = ^(J>j 

i=l i=l j=l 

M s 

~2 



Z i,3 



i=i j=i 

i=i i=i 

where the inequality follows from the convexity of z i— > z 2 . Hence, there is at least one index 
i G {1,2,..., M} for which 



\n — 2d(y, x.i)\ < sv M = y/sn . 

We conclude that is a linear A-almost-balancing set with A = [\/sn/2\ , and its dimension 
is m = log 2 (n/s) < 2(log 2 n - log 2 (2A)). 

We end this section by comparing our results to the following generalization of Theo- 
rem ETTJ 

Theorem 4.4. [I] The dimension of every linear A- almost-balancing set C C F n is at 
least |~log 2 n — log 2 (2A + 1)] . 

For A = 0(y/n), there is still an additive gap of approximately | log 2 n between the lower 
bound and the upper bound guaranteed by Theorem 14. 1\ and for A = Q(\/n), the dimension 
of Cq^ is approximately twice the lower bound. 



5 Balanced error-correcting codes 

In this section, we consider a potential application of linear balancing sets in designing an 
efficient coding scheme that maps information words into balanced words that belong to a 
linear error- correcting code; as such, the scheme combines error-correction capabilities with 
the balancing property. 

The underlying idea is as follows. Let C be a linear [n, k, d] code over F with the length n 
and minimum distance d chosen so as to satisfy the required correction capabilities. Suppose, 
in addition, that we can write C as a direct sum of two linear subspaces C and C" of 
dimensions k' and k" , respectively, 

C = C © C" = {c + x : ceC',xe C"} , (12) 
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where C" is a balancing set0. Now, if k" is "small" (which means that k' is close to k), 
we can encode by first mapping a fc'-bit information word u into a codeword c G C, and 
then finding a word x G C" so that c + a; is balanced. The transmitted codeword is then 
the (balanced) sum c + x. The mapping u \— > c can be implemented simply as a linear 
transformation, whereas the balancing word x can be found by exhaustively searching over 
the 2 k " elements of C" . At the receiving end, we apply a decoder for C (for correcting up to 
(d— 1)/2) errors) to a (possibly noisy) received word c + x + e, where e is the error word. 
Clearly, if w(e) < (d— 1)/2, we will be able to recover c + a? successfully, thereby retrieving n,. 

Obviously, such as scheme is useful only when k" is indeed small: first, k" affects the effec- 
tive rate (given by k'/n = (k—k")/n) and, secondly, the encoding process — as described — is 
exponential in k" . Yet, not always is there a decomposition of C as in (]12p that results in 
a small dimension k" of C" (in fact, for some codes C, such a composition does not exist at 
all). 

A possible solution would then be to reverse the design process and start by first selecting 
the code C so that it has the desired rate R = k'/n and a "slightly" higher minimum distance 
d' than the desired value d. In addition, we assume that there is an efficient (i.e., polynomial- 
time) decoding algorithm T>' for C that corrects any pattern of up to {d— 1)/2 errors. 

Next, we select C" to be a random linear code spanned by k" = |~| log 2 n~\ + p(n) words 
that are chosen independently and uniformly from F n , for some function p{n) = o(logn) 
that grows to infinity. By Theorem 13.11 the code C" will be a balancing set with probability 
1 — exp(— p(n)) = 1 — o(l), and the choice of k" guarantees that an exhaustive search for 
the balancing word x during encoding will take 0(n 3//2+e ) iterations, for an arbitrarily small 
e > (if the search fails — an event that may occur with probability o(l) — we can simply 
replace the code C"). The receiving end can be informed of the choice of the code C" by, say, 
using pseudo-randomness instead of randomness (and flagging a skip when failing to find a 
balancing word x). 

It remains to consider the distance properties of the direct sum C = C © C"; specifically, 
we need the subset of balanced words in C to have minimum distance at least d; in particular, 
every balanced word in C should have a unique decomposition of the form c + x where c G C 
and x G C" . When this condition holds, the decoding can proceed as follows. Given a 
received word y G ¥ n , we enumerate over all words x G C" and then apply the decoder T>' 
to each difference y — x. Decoding will be successful if the number of errors did not exceed 
(d—l)/2, and the decoding complexity will be 0(n 3 ^ 2+t ) times the complexity of V . 

The next lemma considers the case where the code C lies below the Gilbert- Varshamov 
bound. Hereafter, V(n,t) stands for J^* =0 ("). 

Lemma 5.1. Suppose that C is a linear [n, k' , d'\ code over F that satisfies 2 k ' ■ 
V(n, dl — 1) < 2 n . For every d < d' , the minimum distance d(-) of (the random code) 

1 For the scheme to work, it actually suffices that words in C" balance only the elements of C , rather than 
all the words in F™. 
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C = C @ C" satisfies 

Proof. The code C contains \C\ — \C'\ random codewords, each being uniformly distributed 
over ¥ n and therefore each having probability V(n, d— l)/2 n to be of Hamming weight less 
than d. The result follows from the union bound. □ 

It is well known (see [TU| p. 310]) that for any integer t = On < n/2, 

2 nH ^ <V(n,t)<2 nH ^ , 



where H : [0, 1] — > [0, 1] is the binary entropy function defined earlier. Hence, taking k" < 
(| + e) log 2 n, we get from Lemma 15.11 and the concavity of z i— > H(z) that 



Prob{d(C) < d} <V2-n 2+e ( 



d'-l \d'-d 



n-d'+l 



Thus, to achieve a vanishing probability, Prob {d(C) < d}, of ending up with a "bad" code 
C as n goes to infinity, it suffices to take d 1 = d + O(logn) when d/n is fixed and bounded 
away from zero, or d' = d + 0(1) when d is fixed. 

Remark 5.1. Instead of a decoding process whereby we enumerate over the codewords 
of C" and then apply the decoder T>', we could use a decoder for the whole direct sum 
C, if techniques such as iterative decoding are applicable to C: in such circumstances, the 
advantage of the linearity of C is apparent. Linearity certainly helps if we are interested only 
in error detection rather than full correction, in which case the decoding amounts to just 
computing a syndrome with respect to any parity-check matrix of C. □ 



Appendices 



A Proof of Lemma 12.3 

Proof. We have, 

\B(C + ¥x)\ = \B(CU(C + x))\ 

= \B(C)\ + \B(C + x)\ - \B(C) n B(C + x) 

= 2\B(C)\-\B(C)n(B(C) + x)\. 

Hence, 

\B(C + Fa:) | = 2 n+1 \B{C)\ - £ \B(C) n (B(C) + *)! 

x£¥ n xe¥ n 
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Now, 



\m n (B(C) + x)\ 



xew™ 



= \{(x,y) : x e F", y e B(C), y e B(C) + x}\ 
= \{{x,y) : x E F n , y E B(C),x E B(C) + y}\ 
= \B(C)\ 2 . 



Therefore, 



\B(C + ¥x)\= 2 n+l \B(C)\ - \B(C)\ 2 . 



£CgF" 



Using the definition of Q(-) the lemma is proved. 



□ 



B Proof of Theorem 13.5 



We prove Theorem 13.51 below, starting by recalling the reduction that is used in [TT] to show 
the intractability of computing the covering radius of a linear code. 

Let Q = (Vi.V^'.Vs, E) be a tripartite hyper-graph with a vertex set which is the union of 
the disjoint sets Vi, V2, and V3 of the same size t, and a hyper-edge set E = {ei, 62, ■ ■ ■ , e m } C 
Vi x V 2 x V 3 . 

The reduction in [11] maps Q into a 3t x 8m parity-check matrix if = Hg = {H e ) e& Ei 
where each block H e is a 3t x 8 matrix over F whose rows and columns are indexed by 
u E V\ U V2 U V3 and ( ) G F 3 , respectively, and is computed from the hyper-edge 

e = (v e> i,v e>2 ,v ej3 ) as follows: 



(Namely, the three nonzero rows in H e are indexed by the vertices that are incident with the 
hyper-edge e, and these rows form a 3 x 8 matrix whose columns range over all the elements 



A matching in Q is a subset M. C E of size t such that no two hyper-edges in Ai are 
incident with the same vertex (thus, every vertex of Q is incident with exactly one hyper-edge 
in M). 

For our purposes, we can assume that every vertex in Q is incident with at least one 
hyper-edge (or else no matching exists). Under these conditions, m > t and the matrix H 
has full rank (since it contains the identity matrix of order 3t) . 

The proof in [TT] is based on the following two facts: 




if u 7^ v et £ for £ = 1,2,3 



of F 3 .) 
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(i) There is a matching Ai in Q if and only if the all-one column vector 1 in F 3 * can be 
written as a sum of (exactly) t columns of H (note that 1 cannot be the sum of less 
than t columns). Those columns then must be those that are indexed by (1 1 1) in all 
blocks H e such that e £ M.. 

(ii) If A4 is a matching in Q then every column vector in F 3 * can be written as a sum 
Ylie^M where each h e is a column in H e . 



Let C = Cg be the linear [8m, 8m— 3t] code over F with a parity-check matrix H. It 
readily follows from facts (i) and (ii) that Q has a matching if and only if every coset of C 
within F 8m has a word of Hamming weight t. 

From facts (i)-(ii) we get the following lemma. 

Lemma B.l. Suppose that t > 1 and that Q contains a matching. Then every column 
vector in F 3 * can be obtained as a sum of w distinct columns in H , for every w in the range 
t < w < 8m— t. 



Proof. Let M be a matching which is assumed to exist in Q. Given w £ 
{t, t+1, . . . , 8m— t}, write 

a = min{8(m— t), w— t} , 

and let s be a column vector in F 3i which is the sum of a columns in H that do not belong 
to the t blocks H e that correspond to e £ M,. Also, write 



r = w — cr 



t if w < 8m— 7t 

w — 8(m—t) otherwise 



and note that t < r < It. 



Given an arbitrary column vector s £ F 3 *, we show that there are w distinct columns in 
H that sum to s. By fact (ii), for every e £ Ai there is a column h e in H e such that 

s = x+^2h e . (13) 

Furthermore, it follows from the structure of each block H e that when h e ^ 0, then for every 
integer r in the range 1 < r < 7 there exist r distinct columns h e i, h e ^, • • • , h e r in H e such 
that 

r 

J'=l 

The same holds also when h e = for values of r in {0, 1, 3, 4, 5, 7, 8}. 

We conclude that we can find t nonnegative integers {r e ) e( zM such that the following two 
conditions hold: 



17 








H 






• Eeex r e = r (E {t, t+1, . . . , 7t}), and— 

• For each e G M, the column vector h e can be written as a sum of (exactly) r e distinct 
columns of H e . 

Thus, the right-hand side of (fl3j) can be expressed as a sum of a + Egg^ r e = o" + r = w 
distinct columns in H. □ 

Proof of Theorem 13.51 Given a hyper-graph Q, consider the linear [16m— 2t, 8m— 3t] 
code over F with an (8m+t) x (16m— 2t) parity-check matrix 

H' = H' g - 

where H = Hg and I is the identity matrix of order 8m— 2t. Next, we show that there is 
a matching in Q if and only if every coset of C'g contains a balanced word (i.e., a word of 
Hamming weight 8m— t). 

Suppose that Q contains a matching M.. We show that every column vector s G F 8m+ * 
can be expressed as a sum of (exactly) 8m— t distinct columns of H' . Write s T = (sf | s^), 
where Si consists of the first 8m— 2t entries of s (and S2 consists of the remaining 3t entries). 
By Lemma |B.1[ there exist w = 8m—t—w(si) distinct columns in H that sum to «2- Hence, 
by the structure of H' it follows that H' contains w + w(si) = 8m— t columns that sum to s. 

Conversely, suppose that every coset of C'g contains a balanced word. In particular, this 
means that the all-one vector in F 8m+t can be expressed as a sum of 8m— t columns of H'. 
Now, the last 8m— 2t columns of H' must be included in this sum; this, in turn, implies that 
the all-one vector 1 in F 3t can be written as a sum of t columns of H. The result follows 
from fact (i). □ 



C Proof of Lemma 4.3 



Proof. We will follow along the steps of the proof of Lemma I3.2[ except that needs to 
be replaced by a different upper bound which we now derive. Given some 5 G [0, |), let x 
and x' be two distinct words in Co with d(x, x') = rn where | — S < t < ^ + 5. The number 
of words y G F n such that d(x, y) = i and d(x', y) = j is given by 

rn \ / (1— r)n 
Xj-i+Tn)/2j \(i+j-Tn)/2 / 

(here we assume that the binomial coefficient (^) is equal to unless m and k are both 
nonnegative integers and m > k). Hence, 

(n/2)+A (n/2)+A 

\B x (x) n B x (x')\ < J2 E Pt n) - 

i=(n/2)-\ j=(n/2)-X 



(rn) 
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It can be easily verified that when rn is even then 

(rn) _ 



max p- 



rn 



(l-r)n ^ O 2 rn 2^- T > 



*j hJ \m/2j \(l-r)n/2J ~ y^n~/2 y/n(l-r)n/2 
and when rn is odd then 

(™) ( rn \( (l-r)n 

max p; „• = , , 
ij V(™ + l)/2/ \((l-r)n+ l)/2 

1/ rn + 1 \/ (l-r)n+l 

4V(rri + l)/2j \((l-r)n + 1)/2 

{TJ 2 Tn 2( 1_T ) n 
< 



y/irrn/2 ^n{l-T)n/2 ' 
In either case we have: 

\B x (x)nB x (x')\ < (2A + l) 2 max^ ( 7 ) 

971+1 

< (2A + 1)" 



7rn-v/r(T— r) 



nn+2 

< (2A + 1) 2 ==. (14) 

7rnVl-4(5 2 



In addition, from ([9]) and ffTUl) we get: 



IB^^I > ^+1-2— °CD . (is) 



We now proceed as in the proof of Lemma f372l with (114j) replacing ([6]) and with (115j) replacing 
the lower bound in (CQ): by de Caen's lower bound [1] we get a bound which is similar to (JTj), 
in which we plug (- = \\ log 2 n — log 2 (2A + 1)] . The result follows. □ 
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